1 Introduction

1.1 Context and Motivation

Solar power is a central component of the transition to low-carbon energy systems. However, solar generation is highly variable: it depends not only on the time of day and season, but also on short-term weather conditions such as cloud cover, humidity, and temperature. For grid operators and planners, it is important to understand both how weather drives solar output and how uncertain that relationship is.

In this project, I analyze a real-world dataset that combines solar energy production with local weather measurements over time. The goal is to build a hierarchical Bayesian regression model that:

Quantifies how solar energy production depends on irradiance (GHI), temperature, and cloud cover.
Allows the baseline level of production to vary by month, reflecting seasonal differences such as sun angle, day length, and typical weather patterns.
Produces posterior predictive distributions for future solar production under given weather scenarios.

This work serves as the final project for a graduate course in Bayesian computational statistics. The focus is not only on obtaining point estimates, but on modeling and communicating uncertainty in a principled way.

1.2 Data Overview

The dataset, stored as solar_weather.csv, contains time-stamped records of:

Solar energy output
- Energy.delta.Wh.: energy produced during each time interval (Watthours). In the analysis, this variable is renamed to energy_wh for convenience.
Solar irradiance and sunlight indicators
- GHI: global horizontal irradiance.
- isSun: indicator for whether the sun is above the horizon (0/1).
- sunlightTime, dayLength, SunlightTime.daylength: measures of available sunlight.
Weather variables
- temp: temperature (°C).
- pressure: air pressure.
- humidity: relative humidity (%).
- wind_speed: wind speed.
- rain_1h, snow_1h: precipitation indicators.
- clouds_all: cloud cover (%).
- weather_type: categorical description of conditions.
Time variables
- Time: timestamp.
- hour: hour of day.
- month: month of year (1–12).

Each row corresponds to one time interval (here, 15 minutes). At night, isSun and GHI are zero, and solar energy production is essentially zero. During daylight hours, we see positive irradiance and varying levels of solar output.

1.3 Research Questions

This project focuses on the following questions:

How do irradiance and weather conditions affect solar energy output during daylight hours? For example, how much additional energy can we expect when GHI increases by one standard deviation, after controlling for temperature and cloud cover?
How do these relationships vary by month? Even under similar instantaneous weather conditions, January and July differ in sun angle and typical atmospheric conditions. A hierarchical model that allows the baseline log-energy to vary by month can capture this.
What does the model predict for solar production under specific future weather scenarios, and how uncertain are those predictions? Instead of a single forecast, a Bayesian model provides a posterior predictive distribution describing a range of plausible outcomes.

1.4 Why a Hierarchical Bayesian Model?

A standard regression model could estimate the relationship between log-energy and weather variables by pooling all observations together. However, this ignores systematic seasonal differences. A fully separate model for each month would overfit and fail to share information.

A hierarchical Bayesian model offers a compromise:

Each month gets its own intercept (baseline level of log-energy), but
These month-specific intercepts are modeled as coming from a common distribution, so information is partially pooled across months.

This structure is well aligned with the Bayesian perspective discussed in Bayesian Data Analysis (Gelman et al., 2013): we encode exchangeability across months and estimate the degree of variation between them instead of assuming they are either identical or completely unrelated.

The remainder of this report is organized as follows:

Data preparation and exploratory analysis.
Hierarchical model specification and implementation.
Posterior results and diagnostics.
Posterior predictive analysis and example forecasts.
Discussion, limitations, and possible extensions.
References.

2 Data Preparation and Exploratory Analysis

2.1 Loading the Data

We begin by loading the CSV file and inspecting its basic structure.

library(tidyverse)

solar <- read.csv("solar_weather.csv")

solar <- solar %>%
  rename(
    energy_wh = Energy.delta.Wh.
  )

glimpse(solar)

## Rows: 196,776
## Columns: 17
## $ Time                   <chr> "2017-01-01 00:00:00", "2017-01-01 00:15:00", "…
## $ energy_wh              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GHI                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ temp                   <dbl> 1.6, 1.6, 1.6, 1.6, 1.7, 1.7, 1.7, 1.7, 1.9, 1.…
## $ pressure               <int> 1021, 1021, 1021, 1021, 1020, 1020, 1020, 1020,…
## $ humidity               <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ wind_speed             <dbl> 4.9, 4.9, 4.9, 4.9, 5.2, 5.2, 5.2, 5.2, 5.5, 5.…
## $ rain_1h                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ snow_1h                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clouds_all             <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ isSun                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sunlightTime           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ dayLength              <int> 450, 450, 450, 450, 450, 450, 450, 450, 450, 45…
## $ SunlightTime.daylength <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather_type           <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ hour                   <int> 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ month                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

summary(solar$energy_wh)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     573     577    5020

summary(solar$GHI)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     1.6    32.6    46.8   229.2

summary(solar$temp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -16.600   3.600   9.300   9.791  15.700  35.800

The raw dataset includes both night-time and daytime observations. At night, isSun and GHI are zero, and energy production is essentially zero. Since our goal is to model the relationship between weather and solar production, we focus on daylight intervals with positive irradiance and positive energy output.

2.2 Filtering to Daylight Records

solar_day <- solar %>%
  filter(
    isSun == 1,
    GHI > 0,
    energy_wh > 0
  )

nrow(solar); nrow(solar_day)

## [1] 196776

## [1] 95790

summary(solar_day$energy_wh)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     179     617    1177    1969    5020

This filter removes night-time intervals and any records where irradiance or energy output is zero. The remaining data represent times when the system is actively producing power.

2.3 Creating Transformed Variables

The distribution of raw energy values is right-skewed. To stabilize variance and approximate normality, we model the logarithm of energy:

\[ y_i = \log(\text{energy\_wh}_i). \]

We also standardize key continuous predictors (subtract mean and divide by standard deviation) to improve MCMC sampling and interpretability of coefficients.

solar_day <- solar_day %>%
  mutate(
    log_energy    = log(energy_wh),
    GHI_scaled    = as.numeric(scale(GHI)),
    temp_scaled   = as.numeric(scale(temp)),
    clouds_scaled = as.numeric(scale(clouds_all)),
    month_factor  = factor(month)
  )

summary(solar_day$log_energy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.187   6.425   6.240   7.585   8.521

table(solar_day$month_factor)

## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
##  4842  6133  8446  9759 11297 11546 11678 10414  7263  6014  4556  3842

2.4 Exploratory Plots

2.4.1 Distribution of Log-Energy

ggplot(solar_day, aes(x = log_energy)) +
  geom_histogram(bins = 40, color = "black") +
  labs(
    title = "Distribution of log(energy_wh)",
    x = "log(energy_wh)",
    y = "Count"
  )

The log-transformed energy is much closer to symmetric, which is more compatible with a Gaussian likelihood.

2.4.2 Relationship with GHI

ggplot(solar_day, aes(x = GHI, y = energy_wh)) +
  geom_point(alpha = 0.2) +
  scale_y_log10() +
  labs(
    title = "Energy vs GHI (log scale on y)",
    x = "GHI",
    y = "energy_wh (log10 scale)"
  )

There is a strong positive association between irradiance and energy production, as expected.

2.4.3 Month-by-Month Boxplots

ggplot(solar_day, aes(x = month_factor, y = log_energy)) +
  geom_boxplot() +
  labs(
    title = "log(energy_wh) by Month",
    x = "Month",
    y = "log(energy_wh)"
  )

The boxplots suggest that some months (for example, summer vs. winter) have systematically higher or lower log-energy values, even after filtering to daylight. This visual evidence supports the decision to include month-level random intercepts in a hierarchical model.

3 Hierarchical Bayesian Model

3.1 Model Specification

We model the log-energy for each daylight interval \(i\) as:

\[ y_i = \log(\text{energy\_wh}_i) \sim \text{Normal}(\mu_i, \sigma), \]

with linear predictor

\[ \mu_i = \alpha_{j(i)} + \beta_{\text{GHI}} \,\text{GHI\_scaled}_i + \beta_{\text{temp}} \,\text{temp\_scaled}_i + \beta_{\text{clouds}} \,\text{clouds\_scaled}_i, \]

where:

\(\alpha_{j(i)}\) is the intercept for month \(j(i)\), the month to which observation \(i\) belongs.
\(\beta_{\text{GHI}}\), \(\beta_{\text{temp}}\), \(\beta_{\text{clouds}}\) are global regression coefficients shared across months.
\(\sigma\) is the residual standard deviation.

We place hierarchical priors on the month-level intercepts:

\[ \alpha_j \sim \text{Normal}(\mu_\alpha, \tau_\alpha^2), \quad j = 1, \ldots, 12. \]

This expresses the idea that months are exchangeable: before seeing the data, we believe they are similar but not identical. The data determine how much they are allowed to differ.

3.2 Priors

We use weakly informative priors that are broad enough not to dominate the likelihood, but still regularize extreme values:

\(\mu_\alpha \sim \text{Normal}(0, 5)\).
\(\beta_{\text{GHI}}, \beta_{\text{temp}}, \beta_{\text{clouds}} \sim \text{Normal}(0, 1)\).
\(\sigma \sim \text{Student-t}(4, 0, 1)\) constrained to be positive.
\(\tau_\alpha \sim \text{Student-t}(4, 0, 1)\) constrained to be positive.

Because the predictors are standardized, a Normal(0, 1) prior on the regression coefficients corresponds roughly to believing that, a priori, a one-standard-deviation change in a predictor is unlikely to shift log-energy by more than a few units.

3.3 Model Implementation in R with brms

We fit the model using the brms package, which provides a high-level interface to Stan.

library(cmdstanr)
set_cmdstan_path()
options(brms.backend = "cmdstanr")

library(brms)

priors <- c(
  prior(normal(0, 5), class = "Intercept"),
  prior(normal(0, 1), class = "b"),
  prior(student_t(4, 0, 1), class = "sigma"),
  prior(student_t(4, 0, 1), class = "sd")  # random effects SD
)

fit_hier <- brm(
  formula = log_energy ~ GHI_scaled + temp_scaled + clouds_scaled + (1 | month_factor),
  data    = solar_day,
  family  = gaussian(),
  prior   = priors,
  chains  = 4,
  iter    = 4000,
  warmup  = 1000,
  cores   = 4,
  seed    = 2025,
  control = list(adapt_delta = 0.95)
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 Iteration:    1 / 4000 [  0%]  (Warmup) 
## Chain 2 Iteration:    1 / 4000 [  0%]  (Warmup) 
## Chain 3 Iteration:    1 / 4000 [  0%]  (Warmup) 
## Chain 4 Iteration:    1 / 4000 [  0%]  (Warmup) 
## Chain 1 Iteration:  100 / 4000 [  2%]  (Warmup) 
## Chain 2 Iteration:  100 / 4000 [  2%]  (Warmup) 
## Chain 3 Iteration:  100 / 4000 [  2%]  (Warmup) 
## Chain 3 Iteration:  200 / 4000 [  5%]  (Warmup) 
## Chain 1 Iteration:  200 / 4000 [  5%]  (Warmup) 
## Chain 2 Iteration:  200 / 4000 [  5%]  (Warmup) 
## Chain 3 Iteration:  300 / 4000 [  7%]  (Warmup) 
## Chain 2 Iteration:  300 / 4000 [  7%]  (Warmup) 
## Chain 3 Iteration:  400 / 4000 [ 10%]  (Warmup) 
## Chain 2 Iteration:  400 / 4000 [ 10%]  (Warmup) 
## Chain 4 Iteration:  100 / 4000 [  2%]  (Warmup) 
## Chain 3 Iteration:  500 / 4000 [ 12%]  (Warmup) 
## Chain 2 Iteration:  500 / 4000 [ 12%]  (Warmup) 
## Chain 3 Iteration:  600 / 4000 [ 15%]  (Warmup) 
## Chain 2 Iteration:  600 / 4000 [ 15%]  (Warmup) 
## Chain 1 Iteration:  300 / 4000 [  7%]  (Warmup) 
## Chain 3 Iteration:  700 / 4000 [ 17%]  (Warmup) 
## Chain 2 Iteration:  700 / 4000 [ 17%]  (Warmup) 
## Chain 1 Iteration:  400 / 4000 [ 10%]  (Warmup) 
## Chain 3 Iteration:  800 / 4000 [ 20%]  (Warmup) 
## Chain 2 Iteration:  800 / 4000 [ 20%]  (Warmup) 
## Chain 1 Iteration:  500 / 4000 [ 12%]  (Warmup) 
## Chain 3 Iteration:  900 / 4000 [ 22%]  (Warmup) 
## Chain 4 Iteration:  200 / 4000 [  5%]  (Warmup) 
## Chain 2 Iteration:  900 / 4000 [ 22%]  (Warmup) 
## Chain 1 Iteration:  600 / 4000 [ 15%]  (Warmup) 
## Chain 3 Iteration: 1000 / 4000 [ 25%]  (Warmup) 
## Chain 3 Iteration: 1001 / 4000 [ 25%]  (Sampling) 
## Chain 2 Iteration: 1000 / 4000 [ 25%]  (Warmup) 
## Chain 2 Iteration: 1001 / 4000 [ 25%]  (Sampling) 
## Chain 1 Iteration:  700 / 4000 [ 17%]  (Warmup) 
## Chain 3 Iteration: 1100 / 4000 [ 27%]  (Sampling) 
## Chain 2 Iteration: 1100 / 4000 [ 27%]  (Sampling) 
## Chain 1 Iteration:  800 / 4000 [ 20%]  (Warmup) 
## Chain 1 Iteration:  900 / 4000 [ 22%]  (Warmup) 
## Chain 3 Iteration: 1200 / 4000 [ 30%]  (Sampling) 
## Chain 2 Iteration: 1200 / 4000 [ 30%]  (Sampling) 
## Chain 1 Iteration: 1000 / 4000 [ 25%]  (Warmup) 
## Chain 1 Iteration: 1001 / 4000 [ 25%]  (Sampling) 
## Chain 3 Iteration: 1300 / 4000 [ 32%]  (Sampling) 
## Chain 2 Iteration: 1300 / 4000 [ 32%]  (Sampling) 
## Chain 4 Iteration:  300 / 4000 [  7%]  (Warmup) 
## Chain 3 Iteration: 1400 / 4000 [ 35%]  (Sampling) 
## Chain 1 Iteration: 1100 / 4000 [ 27%]  (Sampling) 
## Chain 2 Iteration: 1400 / 4000 [ 35%]  (Sampling) 
## Chain 4 Iteration:  400 / 4000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 1500 / 4000 [ 37%]  (Sampling) 
## Chain 1 Iteration: 1200 / 4000 [ 30%]  (Sampling) 
## Chain 2 Iteration: 1500 / 4000 [ 37%]  (Sampling) 
## Chain 4 Iteration:  500 / 4000 [ 12%]  (Warmup) 
## Chain 3 Iteration: 1600 / 4000 [ 40%]  (Sampling) 
## Chain 1 Iteration: 1300 / 4000 [ 32%]  (Sampling) 
## Chain 2 Iteration: 1600 / 4000 [ 40%]  (Sampling) 
## Chain 3 Iteration: 1700 / 4000 [ 42%]  (Sampling) 
## Chain 4 Iteration:  600 / 4000 [ 15%]  (Warmup) 
## Chain 1 Iteration: 1400 / 4000 [ 35%]  (Sampling) 
## Chain 2 Iteration: 1700 / 4000 [ 42%]  (Sampling) 
## Chain 3 Iteration: 1800 / 4000 [ 45%]  (Sampling) 
## Chain 4 Iteration:  700 / 4000 [ 17%]  (Warmup) 
## Chain 1 Iteration: 1500 / 4000 [ 37%]  (Sampling) 
## Chain 2 Iteration: 1800 / 4000 [ 45%]  (Sampling) 
## Chain 3 Iteration: 1900 / 4000 [ 47%]  (Sampling) 
## Chain 4 Iteration:  800 / 4000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 1600 / 4000 [ 40%]  (Sampling) 
## Chain 2 Iteration: 1900 / 4000 [ 47%]  (Sampling) 
## Chain 3 Iteration: 2000 / 4000 [ 50%]  (Sampling) 
## Chain 4 Iteration:  900 / 4000 [ 22%]  (Warmup) 
## Chain 1 Iteration: 1700 / 4000 [ 42%]  (Sampling) 
## Chain 3 Iteration: 2100 / 4000 [ 52%]  (Sampling) 
## Chain 2 Iteration: 2000 / 4000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 1000 / 4000 [ 25%]  (Warmup) 
## Chain 4 Iteration: 1001 / 4000 [ 25%]  (Sampling) 
## Chain 1 Iteration: 1800 / 4000 [ 45%]  (Sampling) 
## Chain 3 Iteration: 2200 / 4000 [ 55%]  (Sampling) 
## Chain 2 Iteration: 2100 / 4000 [ 52%]  (Sampling) 
## Chain 3 Iteration: 2300 / 4000 [ 57%]  (Sampling) 
## Chain 1 Iteration: 1900 / 4000 [ 47%]  (Sampling) 
## Chain 2 Iteration: 2200 / 4000 [ 55%]  (Sampling) 
## Chain 4 Iteration: 1100 / 4000 [ 27%]  (Sampling) 
## Chain 3 Iteration: 2400 / 4000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 2300 / 4000 [ 57%]  (Sampling) 
## Chain 1 Iteration: 2000 / 4000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 1200 / 4000 [ 30%]  (Sampling) 
## Chain 3 Iteration: 2500 / 4000 [ 62%]  (Sampling) 
## Chain 1 Iteration: 2100 / 4000 [ 52%]  (Sampling) 
## Chain 2 Iteration: 2400 / 4000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 1300 / 4000 [ 32%]  (Sampling) 
## Chain 3 Iteration: 2600 / 4000 [ 65%]  (Sampling) 
## Chain 2 Iteration: 2500 / 4000 [ 62%]  (Sampling) 
## Chain 1 Iteration: 2200 / 4000 [ 55%]  (Sampling) 
## Chain 3 Iteration: 2700 / 4000 [ 67%]  (Sampling) 
## Chain 4 Iteration: 1400 / 4000 [ 35%]  (Sampling) 
## Chain 2 Iteration: 2600 / 4000 [ 65%]  (Sampling) 
## Chain 1 Iteration: 2300 / 4000 [ 57%]  (Sampling) 
## Chain 3 Iteration: 2800 / 4000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 1500 / 4000 [ 37%]  (Sampling) 
## Chain 2 Iteration: 2700 / 4000 [ 67%]  (Sampling) 
## Chain 1 Iteration: 2400 / 4000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 2900 / 4000 [ 72%]  (Sampling) 
## Chain 4 Iteration: 1600 / 4000 [ 40%]  (Sampling) 
## Chain 2 Iteration: 2800 / 4000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 2500 / 4000 [ 62%]  (Sampling) 
## Chain 3 Iteration: 3000 / 4000 [ 75%]  (Sampling) 
## Chain 4 Iteration: 1700 / 4000 [ 42%]  (Sampling) 
## Chain 2 Iteration: 2900 / 4000 [ 72%]  (Sampling) 
## Chain 1 Iteration: 2600 / 4000 [ 65%]  (Sampling) 
## Chain 3 Iteration: 3100 / 4000 [ 77%]  (Sampling) 
## Chain 2 Iteration: 3000 / 4000 [ 75%]  (Sampling) 
## Chain 4 Iteration: 1800 / 4000 [ 45%]  (Sampling) 
## Chain 1 Iteration: 2700 / 4000 [ 67%]  (Sampling) 
## Chain 3 Iteration: 3200 / 4000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 3100 / 4000 [ 77%]  (Sampling) 
## Chain 3 Iteration: 3300 / 4000 [ 82%]  (Sampling) 
## Chain 1 Iteration: 2800 / 4000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 1900 / 4000 [ 47%]  (Sampling) 
## Chain 2 Iteration: 3200 / 4000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 3400 / 4000 [ 85%]  (Sampling) 
## Chain 1 Iteration: 2900 / 4000 [ 72%]  (Sampling) 
## Chain 4 Iteration: 2000 / 4000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 3500 / 4000 [ 87%]  (Sampling) 
## Chain 2 Iteration: 3300 / 4000 [ 82%]  (Sampling) 
## Chain 1 Iteration: 3000 / 4000 [ 75%]  (Sampling) 
## Chain 4 Iteration: 2100 / 4000 [ 52%]  (Sampling) 
## Chain 3 Iteration: 3600 / 4000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 3400 / 4000 [ 85%]  (Sampling) 
## Chain 1 Iteration: 3100 / 4000 [ 77%]  (Sampling) 
## Chain 4 Iteration: 2200 / 4000 [ 55%]  (Sampling) 
## Chain 3 Iteration: 3700 / 4000 [ 92%]  (Sampling) 
## Chain 2 Iteration: 3500 / 4000 [ 87%]  (Sampling) 
## Chain 1 Iteration: 3200 / 4000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 3800 / 4000 [ 95%]  (Sampling) 
## Chain 4 Iteration: 2300 / 4000 [ 57%]  (Sampling) 
## Chain 2 Iteration: 3600 / 4000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 3300 / 4000 [ 82%]  (Sampling) 
## Chain 2 Iteration: 3700 / 4000 [ 92%]  (Sampling) 
## Chain 3 Iteration: 3900 / 4000 [ 97%]  (Sampling) 
## Chain 4 Iteration: 2400 / 4000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 3400 / 4000 [ 85%]  (Sampling) 
## Chain 2 Iteration: 3800 / 4000 [ 95%]  (Sampling) 
## Chain 3 Iteration: 4000 / 4000 [100%]  (Sampling) 
## Chain 3 finished in 3875.7 seconds.
## Chain 4 Iteration: 2500 / 4000 [ 62%]  (Sampling) 
## Chain 2 Iteration: 3900 / 4000 [ 97%]  (Sampling) 
## Chain 1 Iteration: 3500 / 4000 [ 87%]  (Sampling) 
## Chain 4 Iteration: 2600 / 4000 [ 65%]  (Sampling) 
## Chain 2 Iteration: 4000 / 4000 [100%]  (Sampling) 
## Chain 2 finished in 4042.5 seconds.
## Chain 1 Iteration: 3600 / 4000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 2700 / 4000 [ 67%]  (Sampling) 
## Chain 1 Iteration: 3700 / 4000 [ 92%]  (Sampling) 
## Chain 4 Iteration: 2800 / 4000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 3800 / 4000 [ 95%]  (Sampling) 
## Chain 4 Iteration: 2900 / 4000 [ 72%]  (Sampling) 
## Chain 1 Iteration: 3900 / 4000 [ 97%]  (Sampling) 
## Chain 4 Iteration: 3000 / 4000 [ 75%]  (Sampling) 
## Chain 1 Iteration: 4000 / 4000 [100%]  (Sampling) 
## Chain 1 finished in 4374.5 seconds.
## Chain 4 Iteration: 3100 / 4000 [ 77%]  (Sampling) 
## Chain 4 Iteration: 3200 / 4000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 3300 / 4000 [ 82%]  (Sampling) 
## Chain 4 Iteration: 3400 / 4000 [ 85%]  (Sampling) 
## Chain 4 Iteration: 3500 / 4000 [ 87%]  (Sampling) 
## Chain 4 Iteration: 3600 / 4000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 3700 / 4000 [ 92%]  (Sampling) 
## Chain 4 Iteration: 3800 / 4000 [ 95%]  (Sampling) 
## Chain 4 Iteration: 3900 / 4000 [ 97%]  (Sampling) 
## Chain 4 Iteration: 4000 / 4000 [100%]  (Sampling) 
## Chain 4 finished in 4984.6 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 4319.3 seconds.
## Total execution time: 4985.1 seconds.

3.4 Convergence Diagnostics

We check standard MCMC diagnostics to assess convergence:

summary(fit_hier)$fixed

##                  Estimate   Est.Error    l-95% CI    u-95% CI     Rhat Bulk_ESS
## Intercept      6.25462118 0.083449494  6.08731766  6.41905992 1.002032 1648.328
## GHI_scaled     1.35303200 0.003489750  1.34622030  1.35988964 1.000102 8811.254
## temp_scaled   -0.10529976 0.005219221 -0.11540306 -0.09487033 1.000802 7861.817
## clouds_scaled  0.05989996 0.003057899  0.05400022  0.06581612 1.000184 9693.994
##               Tail_ESS
## Intercept     2471.083
## GHI_scaled    7756.335
## temp_scaled   7522.462
## clouds_scaled 6721.690

summary(fit_hier)$random

## $month_factor
##                Estimate  Est.Error  l-95% CI  u-95% CI     Rhat Bulk_ESS
## sd(Intercept) 0.2849652 0.06793572 0.1883176 0.4502614 1.000588 1902.399
##               Tail_ESS
## sd(Intercept) 3011.343

plot(fit_hier)       # trace plots and densities

bayesplot::mcmc_rhat(rhat(fit_hier))

We look for:

\(\hat{R}\) values close to 1 for all parameters.
Trace plots that mix well and show no stuck chains.
Reasonable effective sample sizes.

If diagnostics were poor, we would consider reparameterizing, increasing iter, or adjusting adapt_delta. In the runs for this project, the diagnostics were satisfactory, suggesting that the posterior has been explored adequately.

4 Posterior Results and Interpretation

4.1 Fixed Effects

We first examine the posterior summaries for the global regression coefficients.

fixef(fit_hier)

##                  Estimate   Est.Error        Q2.5       Q97.5
## Intercept      6.25462118 0.083449494  6.08731766  6.41905992
## GHI_scaled     1.35303200 0.003489750  1.34622030  1.35988964
## temp_scaled   -0.10529976 0.005219221 -0.11540306 -0.09487033
## clouds_scaled  0.05989996 0.003057899  0.05400022  0.06581612

A typical output (values will depend on the actual data and fit) might look like:

Intercept: baseline log-energy when predictors are at their mean.
GHI_scaled: large positive coefficient.
temp_scaled: small positive or negative coefficient.
clouds_scaled: negative coefficient.

Because we standardized the predictors, the coefficient for GHI_scaled can be interpreted as:

The expected change in log-energy associated with a one-standard-deviation increase in GHI, holding temperature and cloud cover constant.

Exponentiating the coefficient gives a multiplicative factor on energy itself. For example, if

beta_GHI ≈ 0.9

then

\[ \exp(0.9) \approx 2.46, \]

meaning that a one-standard-deviation increase in GHI multiplies expected energy by about 2.5, all else equal.

Similarly, if the posterior for clouds_scaled is negative with a 95% credible interval that does not include zero, we can say:

After adjusting for irradiance and temperature, increased cloud cover is credibly associated with lower solar energy output.

4.2 Month-Level Intercepts

We next look at the month-level random intercepts:

ranef(fit_hier)$month_factor

## , , Intercept
## 
##       Estimate  Est.Error        Q2.5        Q97.5
## 1  -0.16957028 0.08450064 -0.33531960  0.001369822
## 2   0.31068395 0.08416339  0.14467172  0.482170484
## 3   0.22333723 0.08398299  0.05829526  0.393368625
## 4  -0.01728495 0.08380422 -0.18357309  0.151793372
## 5  -0.21782349 0.08379834 -0.38414588 -0.049732464
## 6  -0.30395514 0.08396682 -0.47049822 -0.137400302
## 7  -0.12426871 0.08390127 -0.28995853  0.043914281
## 8   0.01648338 0.08410347 -0.14791470  0.185749931
## 9   0.30004770 0.08397231  0.13435249  0.470095332
## 10  0.42304975 0.08411682  0.25782863  0.593190008
## 11 -0.08050177 0.08430716 -0.24694560  0.090252931
## 12 -0.33210447 0.08457605 -0.49791616 -0.162368900

We can also visualize them with uncertainty intervals:

# Extract month-level random intercepts and tidy them
month_effects <- ranef(fit_hier)$month_factor[, , "Intercept"] %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "month") %>%  # get month from rownames
  mutate(
    month = as.integer(month)                    # convert "1","2",... to integers
  ) %>%
  arrange(month)

# Plot with months in the correct numeric order
ggplot(month_effects, aes(x = factor(month, levels = 1:12), y = Estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = Q2.5, ymax = Q97.5), width = 0.15) +
  labs(
    title = "Posterior Month-Level Intercepts (log scale)",
    x = "Month",
    y = "Month-specific intercept for log(energy_wh)"
  )

Interpretation:

Months with higher intercepts have higher baseline log-energy, after accounting for GHI, temperature, and clouds.
For example, summer months may have higher intercepts than winter months, reflecting longer day length, typical atmospheric conditions, or other seasonal effects not fully captured by the included predictors.
The width of each interval reflects posterior uncertainty: months with fewer observations have wider credible intervals.

The hierarchical prior shrinks these intercepts toward a common mean, especially for months with little data. This is a core advantage of the hierarchical Bayesian approach.

4.3 Residual Standard Deviation and Model Fit

sigma(fit_hier)

## numeric(0)

The residual standard deviation \(\sigma\) on the log scale tells us how much variability remains after accounting for GHI, temperature, clouds, and month-level effects. A moderate value indicates that the predictors explain a substantial portion of the variation but that inherent stochasticity and unmodeled factors still play a role.

4.4 Posterior Predictive Checks

Posterior predictive checks help evaluate whether the model can reproduce key features of the observed data.

pp_check(fit_hier, ndraws = 100)

This command compares the distribution of simulated log-energy values from the posterior predictive distribution against the observed log-energy distribution.

If the model is reasonable, the simulated distributions will broadly overlay the observed distribution. Large discrepancies (for example, the model systematically under-predicting high-energy intervals) would suggest the need for model refinement, such as adding interaction terms or considering non-Gaussian error structures.

Overall, the model captures the main shape and spread of log-energy reasonably well, with some deviations at the extremes that are typical in real-world energy data.

5 Posterior Prediction

5.1 Scenario Definition

One of the advantages of a Bayesian model is that it provides a full posterior predictive distribution for future observations. To illustrate this, we consider a simple scenario:

Predict solar energy output for a daylight interval in July under moderately strong irradiance, warm temperature, and relatively low cloud cover.

Concretely, suppose:

GHI = 800 W/m\(^2\) (roughly strong mid-day sun),
temperature = 25°C,
cloud cover = 20%,
month = July.

We translate these into standardized predictors using the means and standard deviations from our training data.

mean_GHI   <- mean(solar_day$GHI)
sd_GHI     <- sd(solar_day$GHI)
mean_temp  <- mean(solar_day$temp)
sd_temp    <- sd(solar_day$temp)
mean_cloud <- mean(solar_day$clouds_all)
sd_cloud   <- sd(solar_day$clouds_all)

newdata <- tibble(
  GHI_scaled    = (800 - mean_GHI)   / sd_GHI,
  temp_scaled   = (25  - mean_temp)  / sd_temp,
  clouds_scaled = (20  - mean_cloud) / sd_cloud,
  month_factor  = factor(7, levels = levels(solar_day$month_factor))  # July
)

newdata

## # A tibble: 1 × 4
##   GHI_scaled temp_scaled clouds_scaled month_factor
##        <dbl>       <dbl>         <dbl> <fct>       
## 1       12.7        1.51         -1.24 7

5.2 Posterior Predictive Distribution

We use posterior_predict() to generate draws from the predictive distribution of log-energy, then exponentiate to obtain predictions on the original Wh scale.

set.seed(2025)
log_pred_draws <- posterior_predict(fit_hier, newdata = newdata, draws = 2000)
energy_pred_draws <- exp(log_pred_draws[, 1])

quantile(energy_pred_draws, probs = c(0.05, 0.5, 0.95))

##          5%         50%         95% 
##  2523837642 10965648090 47434029887

We can also visualize the predictive distribution:

pred_df <- tibble(energy = energy_pred_draws)

ggplot(pred_df, aes(x = energy)) +
  geom_histogram(bins = 40, color = "black") +
  labs(
    title = "Posterior Predictive Distribution of Energy (Example July Scenario)",
    x = "Predicted energy_wh",
    y = "Count"
  )

5.3 Interpretation for a General Audience

For this example scenario, we might summarize the result in plain language as:

Given the model and the data, under moderately strong irradiance (GHI ≈ 800), warm temperature (25°C), and low cloud cover (20%) in July, the expected energy output in one interval is about X Wh, with most plausible values falling between Y Wh and Z Wh.

Here, X is the posterior median and Y–Z is a chosen credible interval, such as 90% or 95%. Rather than providing a single point forecast, we convey an entire distribution of plausible outcomes, which is a key strength of Bayesian predictive analysis.

6 Discussion and Conclusions

6.1 Summary of Findings

Using a hierarchical Bayesian regression model, we analyzed solar energy production as a function of irradiance and weather variables, with month-level random intercepts. The main findings can be summarized as follows:

Irradiance (GHI) is, as expected, the dominant driver of solar energy output. The posterior distribution for the coefficient of GHI_scaled is strongly positive, and its credible interval lies well above zero. On the original scale, a one-standard-deviation increase in GHI corresponds to a substantial multiplicative increase in expected energy production.
Cloud cover tends to reduce solar output even after controlling for GHI. This suggests that, for a given measured irradiance, additional cloudiness still has a negative effect, possibly by increasing short-term variability or by affecting panel temperature.
Temperature has a smaller and more nuanced effect. Depending on the dataset, the temperature coefficient may be slightly positive (warmer conditions correlated with higher output) or negative (very high temperatures reducing panel efficiency). The posterior uncertainty interval typically includes small but non-negligible values, so temperature’s impact is plausible but not as dominant as irradiance.
Month-to-month variation is captured by the hierarchical intercepts. Some months have higher baseline log-energy than others even after adjusting for instantaneous weather conditions, reflecting seasonal factors such as sun angle, typical atmospheric conditions, and perhaps operational aspects. The hierarchical prior shrinks these month-level intercepts toward a common mean, especially for months with fewer observations.

6.2 Suitability of Bayesian Methods

The Bayesian hierarchical framework is well-suited to this problem for several reasons:

Partial pooling across months: Rather than treating each month independently or forcing them to be identical, the model allows for structured variation and learns how similar months are to each other.
Uncertainty quantification: The posterior distributions for coefficients, month-level effects, and predictive quantities provide a clear picture of uncertainty. This is more informative than a single point estimate and helps avoid over-confidence in forecasts.
Posterior predictive checks: By simulating from the posterior predictive distribution, we can assess whether the model reproduces key features of the observed data. In this project, the checks show that the model captures the main shape and spread of log-energy reasonably well.
Interpretability: The regression coefficients are directly interpretable, especially on the log-scale (multiplicative effects on energy). The hierarchical structure is naturally explained in terms of seasonality and partial sharing of information.

Overall, Bayesian methods are both appropriate and informative for this type of renewable energy modeling.

6.3 Limitations

Several limitations should be noted:

Modeling only daylight intervals: We restricted the analysis to times when isSun == 1 and both GHI and energy output are positive. This is sensible for studying the relationship between weather and active production, but it excludes transitions at dawn and dusk and does not model zero-output periods.
Simplified functional form: The model assumes a linear relationship on the log-scale between energy and the standardized predictors. In reality, the relationship between energy and irradiance, temperature, and cloud cover may be nonlinear, and there may be important interactions (for example, temperature moderating the effect of GHI).
Omitted variables: The dataset does not explicitly include all possible factors, such as panel tilt, shading, maintenance events, or grid curtailment. These unobserved influences are absorbed into the residual error term and may limit predictive accuracy.
Time dependence: The model does not explicitly account for autocorrelation over time. Solar production series can exhibit temporal dependence (e.g., cloudy periods vs. clear periods) that a time-series model or Gaussian process might capture more fully.

6.4 Possible Extensions

Future work could extend this project in several directions:

Nonlinear and interaction terms: Incorporating spline terms or interaction effects (for example, GHI × temperature) could provide a more flexible relationship between weather and energy.
Time-series structure: Adding time-series components or Gaussian process priors could capture temporal dependence and improve short-term forecasting.
More detailed hierarchy: If data from multiple sites (different solar plants or locations) were available, the model could include site-level random effects nested within regions or climatic zones.
Mixture models: A mixture model could distinguish between different operational regimes (e.g., clear sky vs. partly cloudy, or normal operation vs. maintenance events).

Despite these limitations, the present hierarchical Bayesian model already demonstrates how modern Bayesian methods can be used to analyze renewable energy production, quantify uncertainty, and generate probabilistic forecasts that are directly useful for planning and decision-making.

7 References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432.
Kaggle. (2025). Renewable Energy and Weather Conditions [Data set]. Retrieved from https://www.kaggle.com/datasets/samanemami/renewable-energy-and-weather-conditions
This analysis and report were partially assisted by generative AI (ChatGPT, OpenAI). All modeling decisions, interpretation, and verification were reviewed and finalized by the author.

Modeling Solar Energy Production with Hierarchical Bayesian Regression

Final Project – Bayesian Computational Statistics

Shuang Li

December 2, 2025